Introduction to Medical Statistics 2024
Exercises Class I
Data, Variables, Descriptive Statistics

Author

Ronald Geskus

Published

August 19, 2024

I. Characteristics of a numeric variable

  1. Days off at a mining plant. Workers at a particular mining site receive an average of 35 days paid vacation, which is lower than the national average. The manager is under pressure to increase the amount of paid time off. However, he does not want to give more days off to the workers because that would be costly. Instead he decides to fire 10 employees in such a way as to raise the average number of days off that are reported by his employees. In order to achieve this goal, should he fire employees who have the largest number of paid vacation days, those with the smallest number, or those who have about the average number of days off?
  1. Infant mortality. The infant mortality rate is defined as the number of infant deaths per 1,000 live births. This rate is often used as an indicator of the level of health in a country. The frequency histogram below shows the distribution of estimated infant death rates in 2012 for 222 countries. The height of the bar depicts the frequency of that range of mortality rates.

  1. Guess the first quartile and the median from the histogram.
  2. Would you expect the mean of this data set to be smaller or larger than the median? Explain your reasoning
  1. Distributions and appropriate statistics. For each of the following, describe whether you expect the distribution to be approximately symmetric, right skewed, or left skewed. Also specify whether the mean or median would best represent a “typical” observation in the data, and whether the variability of observations would be better represented by the standard deviation or quartiles/IQR.
  1. DENV viremia in a data set where the first quartile of the values is at 350,000 copies/mL, the median at 450,000, the third quartile at 1,000,000 and 4.3% of individuals has more than 6,000,000 copies/mL.
  2. DENV viremia in a data set where the first quartile of the values is at 300,000 copies/mL, the median is at 600,000, the third quartile at 900,000 and 0.5% of individuals has more than 1,200,000.
  3. Weight distribution of adults living in Ho Chi Minh City
  4. The number of days that individuals stay in ICU at HTD.
  1. Histograms and box plots. Compare the two plots below. What characteristics of the distribution are apparent in the histogram and not in the box plot? What characteristics are apparent in the box plot but not in the histogram?


Today we will use a dataset that contains information on 201 patients with meningitis from 4 different patient groups, determined by whether the patients have tuberculous or cryptococcal meningitis and whether they have HIV coinfection. You can find a description of the variables in the file cmTbmData_description.txt.

II. Data import

  1. Download the dataset cmTbmData.csv and open it in MS Excel. Is the data set “tidy”? Do you agree with the naming of the columns? Describe the type of each of the variables. Save the dataset as an Excel Workbook file.
  1. Import the data set cmTbmDataWithErrors.csv. We purposely created 4 errors in this data set. Create a summary of the variables and see whether you can find them. Hint: look at the variables groupLong, sex, bldwcc and csfwcc.
  1. From now on we will use the data set without the errors. We import the data and show the first six rows.

III. Numerical summaries and data transformations

  1. Summarize the variables age, white cell count in CSF (csfwcc) and sex. Do you think that the variables age and CSF white cell count have a skewed distribution? Have a look at the sex variable. The summary of the sex variable is probably not what you expect. What is the reason?

Make sex into a categorical variable via the factor function; use appropriate labels. Run the summary function again on the sex variable.

  1. Create a new variable log10.csfwcc containing log10-transformed values of white cell count in CSF and add it to the dataset. Check whether the values of log10.csfwcc make sense by applying the summary function to that variable. Do you observe anything strange? If so, what do you think has happened?

How would you solve this problem? Try it out by changing the code above, and make a summary of the logarithm of white cell count in CSF again. Is the distribution of CSF white cell count less skewed after the logarithmic transformation?

IV. Basic visual data summaries

Most of the figures in the practicals of this course make use of the ggplot2 package, but there is often a similar base R plotting functions as well. We only provide answers in ggplot2. This package is based on the “Grammar of Graphics” philosophy, which is further explained in the course Principles of Data Visualization.

  1. In the previous section we produced a numerical summary of CSF white cell count. There was a clear suggestion that it had a skewed distribution, which became much more symmetric after we log-transformed the values. Now we see what we can learn from a histogram. Draw a histogram for white cell count in CSF (csfwcc). Vary the number of bins for the histogram (at the dots …) to see whether that changes the visual appearance. Choose the binwidth that gives a detailed visual representation without too much noise. We first need to “load” the ggplot2 package into our R session. What do you think is the role of the boundary argument.

Do the same for the log-transformed values. Can you tell from the figure whether the distribution of the CSF white cell count becomes less skewed after the log transformation?